Genomic prediction of resistance to Hymenoscyphus fraxineus in common ash (Fraxinus excelsior L.) populations

Abstract The increase in introduced insect pests and pathogens due to anthropogenic environmental changes has become a major concern for tree species worldwide. Common ash (Fraxinus excelsior L.) is one of such species facing a significant threat from the invasive fungal pathogen Hymenoscyphus fraxineus. Some studies have indicated that the susceptibility of ash to the pathogen is genetically determined, providing some hope for accelerated breeding programs that are aimed at increasing the resistance of ash populations. To address this challenge, we used a genomic selection strategy to identify potential genetic markers that are associated with resistance to the pathogen causing ash dieback. Through genome‐wide association studies (GWAS) of 300 common ash individuals from 30 populations across Poland (ddRAD, dataset A), we identified six significant SNP loci with a p‐value ≤1 × 10−4 associated with health status. To further evaluate the effectiveness of GWAS markers in predicting health status, we considered two genomic prediction scenarios. Firstly, we conducted cross‐validation on dataset A. Secondly, we trained markers on dataset A and tested them on dataset B, which involved whole‐genome sequencing of 20 individuals from two populations. Genomic prediction analysis revealed that the top SNPs identified via GWAS exhibited notably higher prediction accuracies compared to randomly selected SNPs, particularly with a larger number of SNPs. Cross‐validation analyses using dataset A showcased high genomic prediction accuracy, predicting tree health status with over 90% accuracy across the top SNP sets ranging from 500 to 10,000 SNPs from the GWAS datasets. However, no significant results emerged for health status when the model trained on dataset A was tested on dataset B. Our findings illuminate potential genetic markers associated with resistance to ash dieback, offering support for future breeding programs in Poland aimed at combating ash dieback and bolstering conservation efforts for this invaluable tree species.


| INTRODUC TI ON
Anthropogenic environmental changes have resulted in an increased occurrence of introduced insect pests and pathogens (Fisher et al., 2012).These changes are primarily driven by factors such as climate change (Jung, 2009;Tubby & Webber, 2010), long-distance transport (Boyd et al., 2013;Guo et al., 2012), and the global trade of living organisms (Santini et al., 2013).The rise in pests and diseases has significant impacts on native species, leading to elevated mortality rates and/or diminished overall health (Dobson, 2009).Among the affected species, long-lived organisms with slow reproduction rates, such as forest trees, are particularly vulnerable to novel pathogens.
Genomic selection (GS) is a breeding strategy that utilizes a prediction model that has been developed from the genotypic and phenotypic data of a training population, to estimate the genomic estimated breeding values (GEBVs) for individuals in a breeding population based on their genomic profiles, as proposed by Meuwissen et al. (2001).These GEBVs are then used for selecting superior individuals as parents for the next generation in breeding programs.
GS offers advantages over traditional breeding methods that rely on phenotypic selection by significantly reducing the duration of the breeding cycle and eliminating the need for extensive field testing of progeny (Heffner et al., 2010).It is especially crucial for traits that manifest late in development (particularly in long-living species) or that are challenging to evaluate accurately, such as pest and disease resistance.Despite the high genetic diversity and genotyping costs associated with forest trees, GS has shown promising results in studies involving Eucalyptus and loblolly pine (Grattapaglia & Resende, 2011;Resende, Resende et al., 2012;Resende, Muñoz et al., 2012).Recent studies focusing on stress resistance, including diseases, pests, and drought, have also reported encouraging findings regarding the prediction of complex traits in forest trees (Beaulieu et al., 2020;Bouvet et al., 2020;Lenz et al., 2020;Stocks et al., 2019;Westbrook et al., 2020).
Common ash (Fraxinus excelsior L.) is a widely distributed deciduous tree species in Europe, extending from the Mediterranean Sea to southern Scandinavia.It plays a significant ecological role in European forests (Hultberg et al., 2020), and it possesses valuable wood properties (Dobrowolska et al., 2011), making it economically important.However, the invasive fungal pathogen Hymenoscyphus fraxineus (HF) (T.Kowalski) Baral, Q ueloz, and Hosoya (Baral et al., 2014), responsible for inducing tree dieback and causing high mortality rates across Europe (Coker et al., 2019), poses a significant threat to the future of ash species.Ash dieback was first identified in the northwest of Poland in 1992(Kowalski, 2006), and it subsequently spread throughout Europe (McKinney et al., 2014;Pautasso et al., 2013).The characteristic symptoms of this disease include crown damage, with shoot, twig, and branch dieback, early leaf wilting and shedding, the formation of necrotic lesions on leaves and cambium, bark discoloration, and the sprouting of epicormic shoots (McKinney et al., 2014;Skovsgaard et al., 2017).Only a small percentage of common ash genotypes within populations, ranging from 1 to 5 percent, remain undamaged (Kjaer et al., 2012;McKinney et al., 2011McKinney et al., , 2014;;Muñoz et al., 2016;Pliūra et al., 2011;Stener, 2013).Identifying genotypes that show no damage is a research priority to facilitate the selection and conservation of ash populations in European woodlands.
Studying resistance in adult ash populations is particularly challenging.The response of adult ash trees to HF infection is complex, and the dieback symptoms may appear and worsen for many years.Therefore, the collection of tolerant and susceptible adults for GWAS studies is recommended to be conducted in populations that have been exposed to HF for an extended period, such as those in Poland.The hope is that the ash trees exhibiting a good health status should be confidently considered resilient to HF.However, it cannot be ruled out that resistant individuals were eliminated due to unfavorable environmental conditions.Conversely, favorable environments may have allowed individuals with lower genetic resistance to thrive.In Poland, where selection pressure has been ongoing for more than 30 years, the identification of genetic markers in common ash populations has not been performed yet.
Reduced representation sequencing approaches can effectively decrease genotyping costs while still enabling the detection of SNPs associated with resistance through GWAS analysis, and enhancing the efficiency of genomic selection (Iwata et al., 2016;Kumar et al., 2012;Poland et al., 2012;Varshney et al., 2005).However, the quality of GWAS studies, particularly those utilizing reduced representation sequencing, also relies on the quality of the assembly and annotation of the reference genomes.Most of the existing GWAS studies in ash have relied on the reference genome, BATG 0.5 (Sollars et al., 2017) In this study, we are using a genomic selection strategy employing an optimal panel of SNPs selected through genome-wide association studies, and we are assessing its effectiveness in predicting ash tree resistance to H. fraxineus.This approach was previously successfully conducted by Stocks et al. (2019), where genomic prediction was used to determine the health status of common ash in southeast England.This study identified 3149 SNPs associated with health status, selected from the extremes of ash dieback damage phenotypes.Next, a genomic prediction model was trained using pool-seq data from 1250 trees, based on 10,000 loci that had the most significant association with disease susceptibility in a pool-seq GWAS.These 10,000 SNPs were able to predict the disease classification of individually sequenced 150 test trees with 68% accuracy.
In our study, we initially conducted a GWAS using doubledigest restriction site-associated DNA (ddRAD) data to identify significant SNPs associated with the health status of 300 individuals of common ash from 30 populations across Poland (dataset A).
Subsequently, we evaluate the genomic prediction accuracies of the most significant GWAS marker sets in predicting health status, considering two scenarios.Firstly, we perform cross-validation on dataset A. Secondly, we train markers using dataset A and evaluate their performance on dataset B, which entails whole-genome sequencing data on 20 individuals from two distinct populations in Poland (dataset B).Additionally, we assess the genomic prediction accuracy for various health status indicators in dataset B, including binary health status, defoliation (%), tree crown vitality, trunk condition, synthetic tree damage index (Syn), and the synthetic variable SynT (the average value of Syn and the trunk condition class after the prior standardization of both variables), using different sets of SNPs.These sets comprise 100, 200, 500, 1000, 2500, 5000, and 10,000 SNPs selected based on GWAS and randomly selected SNPs for both scenarios.Finally, we propose a set of genetic markers and a geographical region that could potentially be supported by breeding programs aimed at enhancing the resistance of ash trees.

| Plant materials
This study was based on two datasets.The first one (dataset A) included 300 common ash individuals from 30 populations (10 individuals per population) in Poland (Figure S1, Table S1).The second (dataset B) consisted of 20 individuals from the Drygały and Grodzisk populations (10 individuals per population), each belonging to one of the two evolutionary lineages of F. excelsior revealed by chloroplast DNA (Heuertz, Fineschi, et al., 2004;Heuertz, Hausman, et al., 2004;Meger et al., 2024) (Figure 1, Table S1).
We selected populations with a high natural infection pressure by H. fraxineus, where apart from the affected trees, there were also trees that did not exhibit any visible disease symptoms.Within each population, consisting of 40 to 50 individuals, five healthy and five diseased trees were selected.An exception was observed in the Drygały and Grodzisk populations, where we found only four and two healthy trees, respectively, while the remaining ones were diseased.All of the analyzed trees were over 70 years old.Fresh leaves were collected in June/July 2017 for dataset A, and in July 2022 for dataset B.

| Phenotypic data
The phenotypic data for both datasets are provided in Table S1.
Each sampled individual from both datasets was rated for their ash dieback damage according to two groups: 0 for a healthy tree with no visible damage by H. fraxineus, and 1 for a strongly damaged tree.
For 20 trees from dataset B, defoliation, tree crown vitality, and trunk condition were additionally assessed.The defoliation of trees was estimated as the percentage of leaf canopy loss from visual expectation from the ground.Defoliation was assessed in steps of 5% increments, starting from 0%, up to 95%.Trees with 100% defoliation were not considered.
The crown vitality was assessed according to Roloff's classification (Roloff, 1988(Roloff, , 2001)), based on the lengths and types of shoots growing in the top of the crown.Trees were classified into four groups of damage: • Class 0 -a live, undamaged tree -the crown shape is spherical without gaps, with long shoots dominating; symmetrical shoots occur, and multi-year branches have a "fan" shape (long shoots come from the axis of the main top shoots and the side branches).
• Class 1 -weakened tree -the upper part of the crown has gaps between the branches of the "lance" type (the main axes of the shoots form shortened long shoots, and the side branches are short shoots), while the inside of the crown is dense.
• Class 2 -damaged tree -clear thinning of the upper and middle parts of the crown, "claw" type branches (the main axes of the top shoots only produce short shoots, forming bending "chains of short shoots" over the years), and single thick branches.
• Class 3 -strongly damaged, dying tree -the crown shape is irregular, with the presence of thick and dying branches, and "claws." The synthetic tree damage index (Dmyterko & Bruchwald, 1998) for each tree was calculated from the combination of defoliation and vitality, according to the following formula: where Syn, synthetic tree damage index; Def, tree defoliation (%); Vit, vitality.
The condition of the trunk was assessed, based on the following classification of three groups: 0 -for a healthy trunk with no visible damage, 1 -for single necrotic lesions at the base of the trunk, 2for visible trunk rot.
The last index of assessing tree damage (total synthetic tree damage -SynT) was created from the average value of the synthetic tree damage class (Syn) and the trunk condition class after the prior F I G U R E 1 Geographical distribution of the studied populations of common ash in Poland.
standardization of both variables.An increase in the SynT value indicates an increase in the overall level of damage to trees.

| DNA extraction and sequencing
Total DNA extraction from 20 individuals from dataset B was based on the method described by Wang et al. (2012).The quality of the isolated genomic DNA was checked using both a Quantus Fluorometer (Promega) and gel electrophoresis.Library preparation using Truseq PCR-free with 350 bp inserts, and whole-genome sequencing using NovaSeq6000 with 150 bp paired-end reads was performed by Macrogen (Amsterdam, The Netherlands).
Raw read sequences from ddRAD (dataset A) were processed for adapter removal using Cutadapt software (Martin, 2011); contaminants (e.g., chloroplast genome sequences) were filtered with the package ERNE-FILTER (Del Fabbro et al., 2013).The raw sequence reads obtained from whole-genome sequencing (dataset B) underwent a trimming process to eliminate low-quality segments and adapters.This trimming procedure utilized Trimmomatic-0.33 (Bolger et al., 2014) and involved quality trimming using a sliding window with a window size of four base pairs and a quality threshold of 15.Any reads with a size less than 36 base pairs were excluded from further analysis.
The reads from both datasets were then aligned to the F. excelsior genome assembly FRAX_001_PL (https:// www.ncbi.nlm.nih. gov/ data-hub/ genome/ GCA_ 01909 7785.1/ ) using the Burrows -Wheeler Aligner (BWA -MEM algorithm) with the default settings (Li & Durbin, 2009).SAM files were converted to BAM files, and these were sorted and indexed using SAMtools v.0.1.19(Li et al., 2009).VCFtools (Danecek et al., 2011) to include only biallelic SNPs that were present in at least 95% of individuals, with a minimum depth of 10 reads and a minor allele frequency >5%.SNPs from dataset A were then pruned for linkage disequilibrium (LD), using PLINK 2.0 (Purcell et al., 2007).A cutoff of r 2 > 0.5 was used for removing SNPs showing strong signals of LD.

| Genetic structure
The genetic structures of the analyzed populations in datasets A and B (a total of 32 populations) were assessed using two distinct methods.First, we conducted principal component analysis (PCA) with the R package SNPRelate version 1.32.1 (Zheng et al., 2012) and visualized the outcomes with ggplot2 version 3.4.0(Wickham, 2016).
Second, we applied the Bayesian model-based clustering method with fastSTRUCTURE version 1.0 software (Raj et al., 2014).Our analysis involved testing for K values ranging from 1 to 32 clusters, with 10 iterations for each run.The optimal K value was determined using the chooseK tool within the package.Visualization of the best results was performed using Clumpak software (Kopelman et al., 2015).

| Genome-wide association study
The genome-wide association study (GWAS) for the health status of dataset A employed the mixed linear model (MLM) method (Yu et al., 2006).This method was implemented through the "GWAS" function in the R package rrBLUP version 4.6.1 (Endelman, 2011).
Both the kinship matrix (K) and the first four principal components (PCs) were incorporated into the MLM model to address relatedness among the genotypes and to avoid false positives.The kinship matrix was generated using the "A.mat" function from the R package rrB-LUP version 4.6.1 (Endelman, 2011).To correct for multiple testing, the method proposed by Li and Ji (2005) was utilized.
Given that health status is a quantitative trait influenced by numerous genetic loci, the Bonferroni test criterion (0.05/number of SNPs) is overly stringent for use as a threshold.As no SNP reached the Bonferroni threshold, the threshold value of p-value ≤1 × 10 −4 was arbitrarily set for the following reasons: Firstly, the relatively small sample size may lead to insufficient statistical power.Secondly, the objective of this analysis is to identify the most significant loci associated with health status, which are crucial for subsequent genomic prediction analyses.Manhattan plots visualizing GWAS results were generated using the R package qqman ver.0.1.8(Turner, 2014).

| Genomic prediction
To evaluate the best genomic prediction accuracy, we employed five regression models, utilizing all SNPs available in dataset A.
The dataset was divided into training and testing populations at a 60/40 ratio for assessment.The regression models included genomic best linear unbiased prediction (GBLUP) with ridge kernel regression (RR) or Gaussian kernel regression (GAUSS), Bayesian Ridge Regression (BRR), Bayesian Lasso (BL), and BayesB.GBLUP with RR and GAUSS were implemented using the "kinship.BLUP" function in the R package rrBLUP version 4.6.1 (Endelman, 2011), while Bayesian linear regression models (BRR, BL, and BayesB) were executed using the R package BGLR ver.1.0.363.Prediction accuracies (r) were estimated using 500 replicates of ten-fold cross-validation, with r computed as the mean Pearson's correlation coefficient between the genomic estimated breeding value (GEBV) and the observed health status.
To assess the effectiveness of GWAS-associated SNPs and validate their potential, we considered two genomic prediction scenarios.Firstly, cross-validation was conducted on dataset A with a training and testing population ratio of 60/40, involving subsets of 100, 200, 500, 1000, 2500, 5000, and 10,000 SNPs selected based on the most significant GWAS results.These results were then compared to sets of randomly selected SNPs from the genome, with each set having an equal number of SNPs as the original set being tested.Missing SNP data were imputed using the "A.mat" function from the R package rrBLUP version 4.6.(Endelman, 2011).
Prediction accuracy was determined as the mean Pearson correlation between GEBV and health status across the 500 replications.

Individuals within the validation population were categorized into
high-and low-susceptibility groups based on their GEBV.Healthy individuals were assigned GEBV values above the median, while diseased individuals were assigned values below the median.The accuracy of this classification was assessed using the formula: f = correct assignments/total assignments, where correct assignments referred to those that aligned with the observed phenotypes.
In the second scenario, genomic prediction models were trained on dataset A and tested on dataset B. The same subsets of SNPs from the previous scenario with the most significant GWAS results and randomly selected SNPs were used for the training and testing populations.To enhance the robustness and reliability of the analysis, the random selection of SNP datasets was repeated 10 times.It's noteworthy that a successful genotyping rate of over 90% was achieved for SNPs from dataset A in dataset B. The health status vector for each individual, denoted as y, was predicted using the rrBLUP model as: y = Xβ + ε, where β represents a vector of the marker effects and Var[ε] is the residual variance.
The design matrix X contains the genotype information, where each row represents an individual and each column represents an SNP.Missing marker data were imputed using the "A.mat" function from the R package rrBLUP version 4.6.1 (Endelman, 2011).
The effect sizes of each considered SNP (β) were estimated using the "mixed.solve"function in the rrBLUP ver.4.6.1 package (Endelman, 2011).Genomic estimated breeding values (GEBV) for each tree in the validation population (dataset B) were calculated using the effect sizes obtained from dataset A. GEBVs were computed as the sum of the additive genetic effects of the SNPs found in dataset B. Trees within the validation population were categorized into high and low susceptibility groups as described above.
The relationships between the GEBVs and the tree phenotypes (health status, defoliation, vitality, trunk condition, Syn, and SynT) were calculated using the Pearson correlation coefficient.In addition, for the random sets, prediction accuracy was determined as the mean Pearson correlation between GEBV and the health status indicators across the 10 sets for each size of SNP.

| Sequencing and genotyping data
Raw data and mapping statistics are provided in Table S2.
Approximately 1.99 and 178 million raw sequence reads per individual were generated for datasets A and B, respectively.Over 97% of reads for both datasets were mapped to the F. excelsior genome assembly.
After mapping the reads to the reference genome, over 2.5 million SNPs were detected in dataset A, while dataset B revealed 56 million SNPs.The SNP loci identified in this study were distributed on all 23 chromosomes (Table S3).However, after filtering, 28,817 loci for dataset A and 7,882,379 for dataset B remained, representing high-confidence SNPs.
The frequency of unfiltered SNPs was 3.28 SNPs per 1 kilobase pair (kbp) for dataset A, and 70.98 SNPs/kbp for dataset B. After filtering, the frequency of SNPs was reduced to 0.04 and 10.50 SNPs/kbp for datasets A and B, respectively (Table S3).In dataset A, chromosome 1 had the highest number of unfiltered SNPs (181,986).However, for filtered SNPs, chromosome 2 had the highest number of SNPs (2022).In dataset B, chromosome 1 had the most SNPs for both unfiltered (3,733,633) and filtered (521,245) SNPs.
For dataset A, the average percentage of missing data per individual was 8.31%.Due to the high missing data ratio, eight individuals were removed from further analyses.For dataset B, the average missing data ratio per individual was 3.68%.

| Genetic structure
The final dataset of SNPs from dataset A was compared to dataset B, and it was found that 26,757 out of 28,817 SNPs from dataset A were variable within dataset B. After applying additional filtering based on a MAF >0.05, a total of 26,134 SNPs were retained for the analysis of the genetic structure of all specimens analyzed in both datasets.
The variances explained by the first two PCA components were 0.0365 and 0.0294, respectively, indicating no clear genetic structure (Figure 2).The individuals form one large group, with outliers from the Drygały and Wołów populations.However, the fastSTRUCTURE function "chooseK" indicated K = 2 as the optimal number of clusters.The graphical representation of the membership coefficients per individual to the clusters is presented in Figure S1.
The majority of individuals in the analyzed dataset were assigned to the first group (blue), while individuals from the Drygały, Jawor, and Wołów populations, which were outliers in PCA, were classified as belonging to the second group (orange).

| GWAS
The GWAS analysis detected six SNP loci, showing a significant association with health status in dataset A, with the p-values ranging from 1.55 × 10 −5 to 9.79 × 10 −5 (Figure 3, Table S4).However, it is important to note that these markers are situated in non-coding regions of the genome.While a relatively lenient threshold was applied, it is essential to emphasize that the primary aim of these findings is to nominate loci for subsequent genomic prediction analyses.

| Genomic prediction
We evaluated the accuracy of the GP model for predicting the health status of individuals in dataset A using all 28,817 SNPs and five regression models previously described.The ten-fold cross-validations (CVs) showed r values ranging from 0.082 to 0.120 (Figure 4), but none of them were statistically significant.Among the models, GBLUP (RR) demonstrated superiority (r = 0.12; p-value = 0.027) in predicting the health status of individuals in dataset A (Figure 4), and this was used for further analysis.
The results of cross-validation conducted on dataset A using various SNP datasets are presented in Figure 5a,b.The genomic prediction accuracy (Figure 5a) was significantly high and improved as the number of SNPs from the GWAS datasets increased.For the 2500 SNP dataset, the prediction accuracy reached r = 0.91 (p-value = 2.2 × 10 −16 ) and continued to increase, reaching r = 0.92 (p-value = 2.2 × 10 −16 ) in the 10,000 SNP dataset.Furthermore, accurate prediction of tree health status (phenotype assignment) exceeded 0.9 across SNP sets ranging from 500 to 10,000 SNPs (Figure 5b).
In contrast, no significant results were obtained when training the model on dataset A and testing its performance on dataset B (Figure 5b,c).Prediction accuracy increased with the number of SNPs, reaching a maximum value of r = 0.30 (p-value = 0.16) for the 10,000 SNP dataset (Figure 5c).However, the correct assignment of health status was only 0.6 for datasets ranging from 1000 to 10,000 SNPs (Figure 5d).It is worth noting, that this was only slightly higher than the results achieved using randomly selected SNP sets of the same size.The estimated effect sizes for SNPs in sets of sizes 100, 200, 500, 1000, 2500, 5000, and 10,000 obtained in crossvalidation in dataset A are shown in Tables S5-S11.
The results showing the correlation of GEBV in dataset B with the remaining health status indicators (defoliation, vitality, Syn, trunk condition, and SynT) are presented in Figure S2.Generally, prediction accuracy was higher when utilizing the top SNPs selected by GWAS compared to randomly selected SNPs, particularly with larger SNP numbers (Figure S2).However, there were exceptions for trunk condition and the SynT index, where prediction accuracy was higher for sets with a smaller number of loci.Notably, significant prediction accuracy was observed with the top 2500 SNPs (r = 0.45; p-value = 0.045), 5000 SNPs (r = 0.48; p-value = 0.033), and 10,000 SNPs (r = 0.51; p-value = 0.027) for defoliation.A similar pattern was visible for Syn, although the genomic prediction indicator was not statistically significant.Furthermore, significantly high prediction accuracy was achieved with 100 SNPs (r = 0.60; p-value = 0.006) and 200 SNPs (r = 0.62; p-value = 0.004).However, this pattern contradicts theoretical predictions and introduces uncertainty into our analysis.Therefore, we abstain from interpretation and acknowledge the necessity for further investigation to accurately elucidate its implications.The estimated effect sizes for SNPs in sets of sizes 2500, 5000, and 10,000 obtained in training the whole dataset A are provided in Tables S12-S14.

| DISCUSS ION
The increasing occurrence of introduced insect pests and pathogens due to anthropogenic environmental changes poses a significant threat to tree species worldwide, including common ash (Fraxinus excelsior L.).Previous genetic studies have demonstrated the heritability and polygenic nature of resistance to Hymenoscyphus fraxineus in common ash, suggesting that these trees may respond well to genomic selection, enabling the breeding or evolution of durable increased resistance.Therefore, GWAS and GP appear to be the most suitable tools for studying and identifying genetic markers that are associated with this complex trait.In this study, we assessed the capability of GWAS and GP to identify potential genetic markers that are associated with resistance to ash dieback.

F I G U R E 4
Genomic prediction accuracy for five predictive models (GBLUP (RR), GBLUP (GAUSS), BRR, LASSO, BayesB) using 28,817 SNPs to predict the health status of individuals in dataset A. Error bars represent the means ± standard error (based on 500 replicates of ten-fold cross-validation).

| Genetic structure
The genetic structure analysis based on PCA revealed that the first two principal components explained a relatively low amount of variance (PC1: 0.0365; PC2: 0.0294), suggesting that there is no clear genetic structure among the individuals (Figure 2).The relatively low differentiation between common ash populations at the nuclear genome may be attributed to pollen-mediated homogenizing gene flow (Heuertz, Fineschi, et al., 2004;Heuertz, Hausman, et al., 2004).
However, further examination using the fastSTRUCTURE software showed that K = 2 was the optimal number of clusters, indicating the presence of two distinct groups within the analyzed dataset.The observed genetic differentiation between the second group of populations (Drygały, Jawor and Wołów) and the rest of the individuals may reflect the postglacial colonization of Poland from the refugia located in the eastern Alps and the Balkans.It is expected that the southwestern populations of ash originate from the refugium in the eastern Alps, while the northeastern populations originate from the Balkan refugium (Heuertz, Fineschi, et al., 2004;Heuertz, Hausman, et al., 2004).These findings emphasize the significance of considering local genetic diversity and structure when formulating strategies for resistance breeding and conservation initiatives in common ash populations.

| Genetic association
The GWAS analysis allowed us to identify six SNP loci strongly associated (p-value ≤1 × 10 −4 ) with the health status of common ash, as depicted in Figure 3 and listed in Table S4.However, the existing genetic structure of the common ash population can lead to falsepositive associations in GWAS analysis (Korte & Farlow, 2013).To address this issue, we applied a correction method by incorporating the first four principal components (PCs) along with a relative kinship matrix (K) in our GWAS analysis.This approach allowed us to account for the population structure and relatedness among individuals, reducing the risk of spurious associations and increasing the accuracy of our genetic association findings.
Previous research using associative transcriptomics has also identified a small number of immune-related genes associated with increased resistance to H. fraxineus in Danish resistance trials.This includes one SNP based on cDNA and two gene expression markers (Harper et al., 2016).Further analysis of this data revealed an additional 20 gene expression markers associated with disease symptoms (Sollars et al., 2017), although none of the genes identified in this study corresponded to those previously found in immunerelated genes.Furthermore, the comprehensive sequencing of 1250 trees and a subsequent genome-wide association study led to the identification of an impressive 3149 SNPs that were significantly associated with disease resistance (Stocks et al., 2019).Intriguingly, out of the 192 most significant SNPs, 61 were associated with genes that have homologs that are known to be involved in plant-pathogen interactions (Stocks et al., 2019).
The low number of genes related to ash dieback found in this study could be attributed to several factors.One possible reason is the use of ddRAD data, which represents only a small fraction of the genome, making it impossible to identify all the genes.Another possible reason is the complexity of the trait being studied.Disease resistance is a multifaceted trait influenced by various genetic and environmental factors, making it challenging to identify all the genes involved.
Moreover, the sample size and statistical power of the GWAS could influence the number of identified genes.With a small sample size, the study may lack sufficient statistical power to detect all of the relevant genetic associations, leading to a lower number of genes being identified.Furthermore, the presence of population-specific adaptations and environmental interactions may additionally complicate the identification of genes associated with disease resistance.

| Genomic prediction
Genomic selection presents a promising avenue for enhancing disease resistance in ash trees within breeding programs.It is widely recognized that genomic prediction accuracy tends to increase with higher marker density (Elshire et al., 2011;Zhang et al., 2019).
However, considerations regarding the cost of genotyping and phenotyping are paramount, as they can significantly impact the feasibility of large-scale implementation.In our study, we opted for ddRAD data for genotyping, a cost-effective and efficient method for identifying SNPs across the genome.Additionally, we adopted a genomic selection strategy where a relatively low number of SNPs selected by GWAS can replace a high density of SNPs without compromising prediction accuracy.
Among the five genomic prediction methods tested, GBLUP (RR) exhibited higher prediction accuracy compared to other models in predicting the health status of individuals in dataset A. However, it's important to note that the difference in accuracy was not statistically significant.Nonetheless, this underscores the potential of the GBLUP (RR) model for accurate phenotypic prediction, aligning with its recognized robust performance under low marker density conditions (Habier et al., 2007).Furthermore, our findings align with earlier studies in maize (Gowda et al., 2015;Zhao et al., 2012), further reinforcing the credibility and applicability of the GBLUP (RR) model across different plant species.
Both genomic prediction scenarios applied in this study demonstrated that utilizing the top-ranked SNPs identified through GWAS yielded higher prediction accuracies for disease resistance traits compared to randomly selected SNPs.This suggests that the loci influencing resistance to ash dieback, as identified through GWAS, played a significant role in enhancing prediction accuracies, even with low-density SNP panels.By pinpointing these key genetic markers associated with disease resistance, GWAS provides valuable insights for targeted breeding programs aimed at improving ash tree resilience to ash dieback.Previous studies across various plant species have also highlighted the effectiveness of leveraging GWAS information to improve genomic selection strategies (Spindel et al., 2016;Stocks et al., 2019;Zhang et al., 2014).Our findings further underscore the importance of integrating GWAS findings into genomic prediction models to enhance breeding efforts for disease resistance in trees.
The levels of prediction accuracy in our study are relatively high and are comparable to those observed in genomic selection studies of other plant species (Biazzi et al., 2017;Grinberg et al., 2016;Müller et al., 2017;Resende, Resende et al., 2012;Resende, Muñoz et al., 2012;Slavov et al., 2014;Spindel et al., 2016;Stocks et al., 2019;Yamashita et al., 2020).Cross-validation analyses using dataset A demonstrated high genomic prediction accuracy, with correlation coefficients (r) surpassing 0.9 for datasets ranging from 2500 to 10,000 SNPs.Moreover, predictions of tree health status (f) achieved an accuracy exceeding 0.9 across SNP sets ranging from 500 to 10,000 SNPs from the GWAS datasets.However, when the model trained on dataset A was tested on dataset B, no significant results were observed for health status prediction.The maximum prediction accuracy reached a value of r = 0.30 (p-value = 0.16) for the 10,000 SNP dataset, while the correct assignment of health status achieved an accuracy of f = 0.6 for datasets ranging from 1000 to 10,000 SNPs.In genomic prediction scenarios where populations from the test set are not included in model training, prediction accuracy may decrease due to genetic and environmental differences between populations, influencing the model's effectiveness in predicting phenotypic traits.
In contrast, significant statistical results were obtained in a study conducted by Stocks et al. (2019) where genomic prediction was used to determine the health status of common ash in southeast England.They demonstrated significant predictions of ash tree This suggests that defoliation may serve as a superior indicator of tree health, possibly more directly linked to the tree's response to fungal infection causing ash dieback.
A common limitation of genomic prediction models is their high population specificity (Müller et al., 2017;Wientjes et al., 2013).

| CON CLUS IONS
The present study demonstrates the potential of a genomic selection strategy for enhancing disease resistance in ash trees within breeding programs.We found that utilizing a relatively low number of SNPs selected through genome-wide association studies (GWAS) can yield high prediction accuracies, highlighting the efficiency of this approach.Cross-validation analyses showcased high genomic prediction accuracy, predicting tree health status with over 90% accuracy across the top SNP sets ranging from 500 to 10,000 SNPs from the GWAS datasets.Our findings illuminate the potential for expediting the breeding of ash dieback-resistant trees in southwestern Poland.To advance the promising prospects of genomic selection, future endeavors should prioritize the expansion of training populations and the implementation of whole-genome sequencing.
These steps hold the potential for further bolstering the effectiveness of genomic selection, ultimately contributing to the preservation and fortification of common ash populations in the face of pressing threats, such as ash dieback.
DNA was extracted from the young leaves of 300 sampled individuals from dataset A using the GeneMATRIX Plant & Fungi DNA Purification Kit (EURx, Poland), following the manufacturer's instructions.Extracted DNA was quantified using the Eppendorf BioPhotometer and Quantus Fluorometer (Promega).Genomic DNA was digested with SphI and MboI.Library preparation and doubledigest restriction site-associated DNA (ddRAD) sequencing using an Illumina HiSeq 2500 platform with 125 bp paired-end reads were performed at IGA Technology Services (Udine, Italy).
GATK 3.5 (DePristo et al., 2011)  was used to detect SNPs in each aligned sample for both datasets separately, using a minimum Phred-scaled confidence threshold of 30.Subsequently, the GATK tools "VariantFiltration" and "SelectVariants" were employed to eliminate low-quality variants.The specific criteria for exclusion included: QD < 20.0, MQ < 40.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0.The filtering of SNPs was performed using Principal component analysis plot of PC1 and PC2, based on 26,757 SNPs scored in 32 populations of common ash.The populations considered as outliers are marked with a triangle.F I G U R E 3 Manhattan plot of the results from the GWAS for health status in dataset A. Blue line indicates the threshold of the pvalue ≤1 × 10 −4 .

F I G U R E 5
Genomic prediction accuracy and phenotype assignment results using the cross-validation model conducted on dataset A (a, b) and training the model on dataset A and testing its performance on dataset B (c, d) to predict health status of common ash.In panels (a, b), error bars represent the means ± standard error (based on 500 replicates of ten-fold cross-validation).In panels (c, d), error bars for random datasets represent the means ± standard error (based on the selection of random SNP datasets performed 10 times).
health using pool-seq data from 1250 trees for training and testing on 150 individuals, which were not part of the DNA pools.In this scenario, they achieved the highest accuracy with a correlation coefficient (r) of observed scores and GEBV of r = 0.35, along with a frequency of correct allocations of f = 0.67.The additional correlations of GEBV in dataset B with other health status indicators revealed notable prediction accuracies for defoliation across specific SNP sets: 2500 SNPs (r = 0.45; pvalue = 0.045), 5000 SNPs (r = 0.48; p-value = 0.033), and 10,000 SNPs (r = 0.51; p-value = 0.027), selected based on GWAS findings.
Hence, our results hold the potential for expediting the breeding of ash dieback-resistant trees in southwestern Poland (the region covered by dataset A).Unfortunately, widespread alleles involved in ash dieback resistance across all analyzed populations in Poland were not identified.Notably, in the second genomic prediction scenario, only 20 samples were used for testing, potentially limiting the scope and generalizability of our findings.Further research with larger sample sizes and diverse populations is crucial to fully elucidate the genetic basis of ash dieback resistance across different regions in Poland.Moreover, leveraging whole-genome sequencing can lead to a higher marker density, facilitating the discovery of additional genetic variants associated with disease resistance.
. However, we recently published the new reference genome of F. excelsior, accessible at (https:// www.ncbi.